LIGHT CSW - Introduction to R

Author

Marc Henrion, Alex Richards

Published

March 1, 2023

License

This document and its contents are licensed under a Creative Commons Attribution 4.0 International License.

Introduction to R

Getting started

R and RStudio IDE

There are 2 main parts of software that you will need.

  • R is an environment for statistical computing. It is a programming language, specifically a scripting / interpreted language that does not require compilation.
  • RStudio is a company who have developed the most used graphical user interface (GUI) / integrated development environment (IDE) for R.

R is free and open source. Most latest statistical developments are first implemented in R. Unlike commercial packages (SAS, Stata, …), there is little to no quality control for R packages other than feedback from users. Be aware of this!

You will mostly interact with R through RStudio, so familiarise yourself with the interface:

RStudio interace

Resources

R packages

R is stand-alone and you can code any functions you need for your analysis. But why re-invent the wheel, when others have already done this? This is where R packages come in: they are sets of R functions and data that you can load to access specific functions, models etc. In fact, a standard installation of R comes with 15 core and 15 recommended packages:

  • Base: base, compiler, datasets, graphics, grDevices, grid, methods, parallel, splines, stats, stats4, tcltk, tools, translations, utils
  • Recommended: KernSmooth, MASS, Matrix, boot, class, cluster, codetools, foreign, lattice, mgcv, nlme, nnet, rpart, spatial, survival

Packages are hosted either on the Comprehensive R Archive Network (CRAN, https://cran.r-project.org) or on Bioconductor (https://bioconductor.org). CRAN is the most commonly used of these with Bioconductor packages mostly targeted at bioinformaticians.

You install packages from CRAN by using the install.packages() function.

For example:

install.packages("tidyverse")
install.packages("lubridate")

And you load packages by using the function library(), for example:

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.1     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.4     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(lubridate)

Attaching package: 'lubridate'

The following objects are masked from 'package:base':

    date, intersect, setdiff, union
library(gridExtra)

Attaching package: 'gridExtra'

The following object is masked from 'package:dplyr':

    combine

Installation details for Bioconductor packages are given on the Bioconductor website for each package (see e.g. https://bioconductor.org/packages/release/bioc/html/variancePartition.html).

Reading and writing data

As the most basic task for any analysis in R, you need to be able to read your data into R and, at the end, write the results to a file on your har drive.

Let’s start by using some data that R comes with, write it out to the disk, then read it back in. The dataset iris is a famous dataset recording 150 observations of 3 species of flowers. It was extensively used by Ronald Fisher as he developed his statistical theory. Find out more about the data by typing ?iris at the R console.

Working directory

R has a working directory, this is the directory on your computer that you are working from. Unless you give full, absolute paths to files, all paths and filenames will be relative to this path.

You can find out the current working directory with the getwd() function:

getwd()
[1] "C:/work/MLW_LSTM/Trainings/LIGHT_CSW/R_tutorial"

And you can set it using the setwd() command. Note that R requires, even on Windows machines, forward slashes not backward slashes (i.e. "C:/Users/marc/Documents" not "C:\Users\marc\Documents").

setwd("C:/Users/marc/Documents/")

Writing data to disk

We will use iris which is a data frame of 150 rows and 5 columns (variables Sepal.Length, Sepal.Width, Petal.Width, Species). If you want to see another way of storing the same data, have a look at iris3 - a 3-dimensional array.

head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
dim(iris)
[1] 150   5
table(iris$Species)

    setosa versicolor  virginica 
        50         50         50 

Now, let’s write this as a comma-separated file (we first create a sub-directory to put our files in):

dir.create("dataAndSupportDocs",showWarnings=FALSE) # creates a directory in your current working directory
write.csv(iris,file="dataAndSupportDocs/iris.csv",row.names=FALSE) # saves the data to a csv file

Check on your hard drive, you will now have a file named iris.csv (inside a directory named dataAndSupportDocs) which you can open with a text editor, Excel, …

If you want a bit more control over how the file is written (e.g. different column separator, whether or not to include row and/or column names), you can use the more general write.table() function.

write.table(iris,file="dataAndSupportDocs/iris.tab",sep="\t",row.names=FALSE) # tab separated

Reading data from disk

Let’s suppose now that this file is our dataset and we want to load it into R so that we can work with it. We need to read the content into memory (RAM) and assign it to an object in R that we can then manipulate or feed into different functions for analysis or visualisation. For this we use the function read.table() and the assignment operator <-.

dat<-read.csv("dataAndSupportDocs/iris.csv")

head(dat)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
dim(dat)
[1] 150   5
table(dat$Species)

    setosa versicolor  virginica 
        50         50         50 

There are alternatives: the more general function read.table() or the function read_csv() from the R package readr which is one package contained in a set known as the tidyverse. read_csv() will store the data as a tibble which is a special type of data frame.

dat2<-read.table(sep=",",header=T,file="dataAndSupportDocs/iris.csv")

head(dat2)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
dim(dat2)
[1] 150   5
table(dat2$Species)

    setosa versicolor  virginica 
        50         50         50 
# library(tidyverse)
# tidyverse / readr package needed, else read_csv() won't be available
dat3<-read_csv("dataAndSupportDocs/iris.csv")
Rows: 150 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Species
dbl (4): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(dat3)
# A tibble: 6 × 5
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl> <chr>  
1          5.1         3.5          1.4         0.2 setosa 
2          4.9         3            1.4         0.2 setosa 
3          4.7         3.2          1.3         0.2 setosa 
4          4.6         3.1          1.5         0.2 setosa 
5          5           3.6          1.4         0.2 setosa 
6          5.4         3.9          1.7         0.4 setosa 
dim(dat3)
[1] 150   5
table(dat3$Species)

    setosa versicolor  virginica 
        50         50         50 

Note above the line # library(tidyverse). The hash symbol at the start means this line is a comment, this code will not be run and can be free text. Comments are useful to make your code more readable to others (including your future self!). It is good practice to use comments.

Comments can also be used to skip some lines of code, which is what was done here. The comment is just to alert you to the fact that you will need to load the tidyverse packages for the function read_csv() to be available.

Assignment operator

For the assignment operator <- that we used above, you can also write = or -> (and reversing the left and right hand sides with that last operator), but R coders usually prefer to use <-. It basically means that you create an object named dat in you current working memory and you assign to this object what follows on the right hand side of the operator (in this case the contents of the file iris.csv).

Working with data

Data & object types

Data types (in general)

A variable has one of 4 levels of measurement:

  • nominal - classifies observations into different categories
    • alive / dead
    • human, fish, goat, bird
  • ordinal - different categories, but categories are ordered
    • low, medium, high
  • interval - ordered data with degree of difference; ratios not meaningful
    • temperature in centigrade: difference betwen 10o and 20o same as between 50o and 60o but 20o not twice as hot as 10o
  • ratio - interval data with a unique, non arbitrary zero value; ratios meaningful
    • temperature in Kelvin
    • length

Data types (in R)

R supports the following data types:

  • Character
a<-c("male","female")
  • Integer
b<-1:4
  • Numeric / double
c<-8/11
d<-pi
  • Logical / boolean
e<-1:10 %% 2 == 0
f<-TRUE
g<-4<2
  • Factor / nominal
h<-factor(sample(a,size=20,replace=T))

Useful functions

Display data type of an object:

typeof()

Check if an object is of a certain data type:

is.character()
is.integer()
is.numeric()
is.logical()
is.factor()

Force data type change for an object:

as.character()
as.integer()
as.numeric()
as.logical()
as.factor()

Dates

Dates are a special type of data and require special functions to work with. By far the most helpful package is lubridate. It allows you to easily parse date format, using functions like ymd(), ymd_hms(), ym()

See the lubridate cheat sheet for more information.

ymd("2022-06-04")
[1] "2022-06-04"
ymd("2022 June 4")
[1] "2022-06-04"
ymd_hms("2022-06-04 10:00:00")
[1] "2022-06-04 10:00:00 UTC"
ym("2022-06")
[1] "2022-06-01"
ymd("2022-06-04") - ymd("2022-03-02")
Time difference of 94 days

Object types

  • Vectors
    • A single number or word is just a vector of length 1.
    • You can create vectors with the function c().
    • E.g. v1<-c(1,2,-12.34,5/6)
    • Shorthands for certain types of vectors, e.g. v2<-1:10
    • You can access individual elements of a vector, e.g. 3rd element v1[3]
  • Matrices
    • Matrices hold data tables (rows and columns).
    • All elements of a matrix need to be of the same type (e.g. all numeric or all character).
    • You can create matrices with the function matrix().
    • E.g. m1<-matrix(nrow=2,byrow=TRUE,1:16)
    • You can access individual elements of a matrix, e.g. 2nd row, 3rd column m1[2,3]
  • Arrays
    • Generalise matrices to have any number of dimensions (i.e. more than just 2, e.g. 3x3x2).
    • Not used that commonly.
    • E.g. a1<-table(mtcars$cyl,mtcars$gear,mtcars$am)
    • You can access individual elements of an array, e.g. a1[3,1,2]
  • Lists
    • Lists are very flexible objects that can hold several elements.
    • Each element can be of a different type.
    • Elements can (but don’t need to) be named.
    • E.g. l1<-list(wordVector=c("hello","how","are","you","?"),numberMatrix=matrix(nrow=3,1:9))
    • You can access individual elements, e.g. first element l1[[1]] or by name l1$wordVector
  • Data frames
    • This is the most common object you will use to work with data, e.g. read.csv() returns a data frame.
    • Data frames are like matrices in that they provide a data table (rows and columns) but unlike matrices allow different data types for different columns.
    • Technically data frames are special types of lists. This means that you can access individual columns / variables by using the $ notation.
    • E.g. df1<-data.frame(pid=1:5,name=c("Chimwemwe","Esther","John","Dalitso","Julie")) or object dat from earlier.
    • You can access individual variables, e.g. df1$name or df1[,"name"] or df[,2].
    • You can access individual rows, e.g. df1[1,].
    • You can access individual elements, e.g. df1[3,2] or df$name[3]
  • Tibbles
    • Tibbles are data frames, but have special methods for printing, accessing and working with them.
    • Tibbles are used by the tidyverse packages.
    • E.g. read_csv() returns a tibble.
    • Use tibbles just like data frames.

The rules of tidy data:

  • 1 row = 1 observation
  • 1 column = 1 variable
  • 1 cell = 1 value.

Handling data

Filter (observations) & select (variables)

Often you only want to work with a subset of your data. This may be that you are interested only in one group of observations or that you may want to get rid of irrelevant variables. The dplyr package provides 2 easy to use function for this:

  • filter() allows filtering out only certain observations
  • select() allows selecting specific variables

For example:

filter(.data=iris,Species=="setosa") # returns only the observations where the Species variable is set to "setosa"
select(.data=iris,Petal.Width,Petal.Length) # returns only the Petal.Width and Petal.Length variables from the iris data frame

Often you want to combine these operations. For this, pipes are useful - see below.

Pipe operators

The tidyverse package magrittr introduced a pipe operator, %>%, to R. You may be familiar with pipes if you have done any shell scripting before (e.g. bash or zsh where the pipe operator is |). Pipes allow to transfer the output of one function as input to another function. This allows for more concise code.

Since R v4.1.0 there is also a native pipe symbol in R: |>.

For example:

iris %>%
  filter(Species=="setosa") %>%
  select(Petal.Width,Petal.Length) %>%
  head(n=5)
  Petal.Width Petal.Length
1         0.2          1.4
2         0.2          1.4
3         0.2          1.3
4         0.2          1.5
5         0.2          1.4
iris |>
  filter(Species=="setosa") |>
  select(Petal.Width,Petal.Length) |>
  head(n=5)
  Petal.Width Petal.Length
1         0.2          1.4
2         0.2          1.4
3         0.2          1.3
4         0.2          1.5
5         0.2          1.4

Creating new variables

To create new variables from existing ones, you can use the function mutate() (add new variables and preserve existing ones) and transmute() (add new variables but drop existing ones) from the dplyr package (part of the tidyverse).

irisNew <- iris %>%
  mutate(
    LRatio=Sepal.Length/Petal.Length,
    WRatio=Sepal.Width/Petal.Width
  )
head(irisNew)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species   LRatio WRatio
1          5.1         3.5          1.4         0.2  setosa 3.642857  17.50
2          4.9         3.0          1.4         0.2  setosa 3.500000  15.00
3          4.7         3.2          1.3         0.2  setosa 3.615385  16.00
4          4.6         3.1          1.5         0.2  setosa 3.066667  15.50
5          5.0         3.6          1.4         0.2  setosa 3.571429  18.00
6          5.4         3.9          1.7         0.4  setosa 3.176471   9.75
irisNew2 <- iris %>%
  transmute(
    LRatio=Sepal.Length/Petal.Length,
    WRatio=Sepal.Width/Petal.Width
  )

head(irisNew2)
    LRatio WRatio
1 3.642857  17.50
2 3.500000  15.00
3 3.615385  16.00
4 3.066667  15.50
5 3.571429  18.00
6 3.176471   9.75

This can also be useful to recode categorical variables; e.g. suppose we wish to change the Species variable in the iris dataset so that virginica and versicolor are combined into a single value.

irisNew3 <- iris %>%
  mutate(
    Species=fct_recode(Species,setosa="setosa",virginica_versicolor="virginica",virginica_versicolor="versicolor")
  )

table(iris$Species)

    setosa versicolor  virginica 
        50         50         50 
table(irisNew3$Species)

              setosa virginica_versicolor 
                  50                  100 

Another useful situation may be when a binary or dichotomous variable is coded as 1/2 (here we assume 1=success and 2=failure) but you want it coded 0/1 (0=failure, 1=success).

# first we create a data frame with an ID variable and a dichotomous variable, binVar, codes as 1/2
# then we use the mutate() function to create a new variable called binVar01 which will have the recoded values for variable binVar - now coded as 0/1
# the actual conversion is done using the function case_when()
df<-data.frame(
  ID=1:10,
  binVar=sample(1:2,replace=T,size=10,prob=c(0.2,0.8))
) %>%
  mutate(
    binVar01=case_when(
      binVar==2~0,
      binVar==1~1
    )
  )

table(df$binVar)

1 2 
3 7 
table(df$binVar01)

0 1 
7 3 

Join / merge data tables

At a very basic level you can use the commands rbind() and cbind() to join vectors / matrices / data frames by rows (rbind) or columns (cbind). This requires the objects getting combined to have either the same number and same order of columns (when rbind() is used) or rows (for cbind()).

irisRatios<-data.frame(
  LRatio=iris$Sepal.Length/iris$Petal.Length,
  WRatio=iris$Sepal.Width/iris$Petal.Width
)

irisNew3<-cbind(iris,irisRatios)

head(irisNew3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species   LRatio WRatio
1          5.1         3.5          1.4         0.2  setosa 3.642857  17.50
2          4.9         3.0          1.4         0.2  setosa 3.500000  15.00
3          4.7         3.2          1.3         0.2  setosa 3.615385  16.00
4          4.6         3.1          1.5         0.2  setosa 3.066667  15.50
5          5.0         3.6          1.4         0.2  setosa 3.571429  18.00
6          5.4         3.9          1.7         0.4  setosa 3.176471   9.75
newMatrix<-rbind(1:3,4:6)
head(newMatrix)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6

However, in many situations we want to join tables of different dimensions and we want to extract some information from one data frame and add it to another. For example you may have a first data frame consisting of individual-level data on patients and which drug they were given and a second data frame consisting of data on drugs - what is the active ingredient for example. You may want to add the drug data to the individual level data based on which drug each patient received.

For such operations, joins are important operations. There are many ways you can join 2 data frame, depending on how you want to add information. The graphic below summarises different joins: left joins (almost always what you want to do), right joins, full joins and inner joins.

You can use the dplyr package functions left_join(), right_join(), full_join() and inner_join() to do such join operations.

Wickham, H. & Grolemund G., R for Data Science, O’Reilly, 2016

# Patient-level data
pDat<-data.frame(
  name=c("Marc","Ulemu","Clemens","Brigitte"),
  age=c(37,48,46,38),
  drug=c("Lariam","Lariam","Malarone","none")
)

# Drug-level data
dDat<-data.frame(
  commonName=c("Acticlate","Aralen","Jasoprim","Lariam","Malarone","Malirid","Monodox"),
  ingredient=c("Doxycycline","Chloroquine","Primaquine","Mefloquine","Atavaquone/Proguanil","Primaquine","Doxycyline")
)

# Left join
pDatWithDrug<-left_join(pDat,dDat,by=c("drug" = "commonName"))

head(pDat)
      name age     drug
1     Marc  37   Lariam
2    Ulemu  48   Lariam
3  Clemens  46 Malarone
4 Brigitte  38     none
head(dDat)
  commonName           ingredient
1  Acticlate          Doxycycline
2     Aralen          Chloroquine
3   Jasoprim           Primaquine
4     Lariam           Mefloquine
5   Malarone Atavaquone/Proguanil
6    Malirid           Primaquine
head(pDatWithDrug)
      name age     drug           ingredient
1     Marc  37   Lariam           Mefloquine
2    Ulemu  48   Lariam           Mefloquine
3  Clemens  46 Malarone Atavaquone/Proguanil
4 Brigitte  38     none                 <NA>

Pivot data (wide \(\rightarrow\) long and long \(\rightarrow\) wide)

When data are collected with repeated measurements on the same unit over time or across different conditions (e.g. longitudinal patient data, or before / after data on the same patients), the data can be either in a wide format or a long format.

Depending on what analysis is going to be done, the data may need to be in either of these 2 formats. It is important to be able to switch between both formats when needed. The package tidyr provides 2 useful functions for this: pivot_longer() and pivot_wider().

Their use is best illustrated with an example.

datWide <- read.table(header=TRUE, text='
                      subject sex control cond1 cond2
                      1       M       7.9  12.3  10.7
                      2       F       6.3  10.6  11.1
                      3       F       9.5  13.1  13.8
                      4       M      11.5  13.4  12.9
                      ')
datWide$subject <- factor(datWide$subject)

print(datWide)
  subject sex control cond1 cond2
1       1   M     7.9  12.3  10.7
2       2   F     6.3  10.6  11.1
3       3   F     9.5  13.1  13.8
4       4   M    11.5  13.4  12.9

This is the wide format: 1 individual per row, multiple observations of the same variable under different conditions or fixed timepoints.

Wide to long: pivot_longer()

datLong<-pivot_longer(datWide,cols=c(control,cond1,cond2),names_to="condition",values_to="measurement")
print(datLong)
## # A tibble: 12 × 4
##    subject sex   condition measurement
##    <fct>   <chr> <chr>           <dbl>
##  1 1       M     control           7.9
##  2 1       M     cond1            12.3
##  3 1       M     cond2            10.7
##  4 2       F     control           6.3
##  5 2       F     cond1            10.6
##  6 2       F     cond2            11.1
##  7 3       F     control           9.5
##  8 3       F     cond1            13.1
##  9 3       F     cond2            13.8
## 10 4       M     control          11.5
## 11 4       M     cond1            13.4
## 12 4       M     cond2            12.9

Long to wide: pivot_wider()

pivot_wider(data=datLong,names_from=condition,values_from=measurement)
## # A tibble: 4 × 5
##   subject sex   control cond1 cond2
##   <fct>   <chr>   <dbl> <dbl> <dbl>
## 1 1       M         7.9  12.3  10.7
## 2 2       F         6.3  10.6  11.1
## 3 3       F         9.5  13.1  13.8
## 4 4       M        11.5  13.4  12.9

Notes:

  • If you have longitudinal data with variable time points, you can only use the long format.
  • The base R package stats also provides a function that allows switching between long and wide formats: reshape(). You can explore this function by yourself (type ?reshape at the R console to open the documentation page for it), but generally the use of pivot_longer() and pivot_wider() is more intuitive.

Coding

R scripts

When you develop code for an analysis, this will often be a lengthy set of R commands. To access this as you develop the code, it is best to save the analysis code in a script file.

In RStudio, click on File, then New File and then R Script. In the script editor a new, blank script will open. This is just a text file that can hold your R code. You can save this to disk like any other file. You could use a .txt extension, but typically R scripts are saved as .R files.

In general:

  • Experiment with code on the console.
  • Save the code you want to keep in a script file using the script editor.

R functions

When you need to do a given set of commands over and over again, it may be more efficient to group them in an R function - it avoids you to write repetitive code.

To define a function you need to use the R function function(). Inside the brackets you then specify the input arguments to the function you define. The body of code for the function follows in curly brackets after the call to function.

Below is an example where we define a new function, summaryFunction(), that will compute the mean and standard deviation of a specific variable in a data frame and returns this as a list of length 2.

summaryFunction<-function(dat,var){
  # dat = a data frame
  # var = a character string indicating the name of the variable in the dat data frame that is to be summarised
  
  m<-mean(dat[,var])
  s<-sd(dat[,var])
  
  return(list(mean=m,sd=s))
}

summaryFunction(dat=iris,var="Sepal.Width")
$mean
[1] 3.057333

$sd
[1] 0.4358663

Iterations and control statements: for loops and if statements

A key concept in most programming languages are mechanisms to iterate a set of instructions and to control the execution of parts of the code. There are many of these in R, but we will focus on 2 key concepts here:

  • for loops - to iterate a section of code over a vector, data frame or list

  • if statements - to write sections of code that are only run conditional on certain condition(s) to be met

# create an empty vector of the desired size
flowerSize<-rep(NA,nrow(iris))

# iterate over all rows of the iris data frame
for(i in 1:nrow(iris)){
  flowerSize[i]<-iris$Petal.Length[i]+iris$Petal.Width[i]+iris$Sepal.Length[i]+iris$Sepal.Width[i]
}

irisNew4<-cbind(iris,flowerSize)

head(irisNew4)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species flowerSize
1          5.1         3.5          1.4         0.2  setosa       10.2
2          4.9         3.0          1.4         0.2  setosa        9.5
3          4.7         3.2          1.3         0.2  setosa        9.4
4          4.6         3.1          1.5         0.2  setosa        9.4
5          5.0         3.6          1.4         0.2  setosa       10.2
6          5.4         3.9          1.7         0.4  setosa       11.4

for loops are slow in R. Whenever possible, attempt to vectorise your operations.

# what we've just done
forLoop<-function(){
    flowerSize<-rep(NA,nrow(iris))
  
  for(i in 1:nrow(iris)){
    flowerSize[i]<-iris$Petal.Length[i]+iris$Petal.Width[i]+iris$Sepal.Length[i]+iris$Sepal.Width[i]
  }
}

# same operation but vectorised
vectorOperation<-function(){
  flowerSize<-iris$Petal.Length+iris$Petal.Width+iris$Sepal.Length+iris$Sepal.Width
}

system.time(forLoop())
   user  system elapsed 
      0       0       0 
system.time(vectorOperation())
   user  system elapsed 
      0       0       0 

Parameter estimation / summary statistics (primer)

Proportions

The function binom.test() allows you to easily compute proportion estimates and exact binomial 95% confidence intervals:

propEst<-binom.test(x=sum(dat$Species=="setosa"),n=nrow(dat))
propEst$estimate
probability of success 
             0.3333333 
propEst$conf.int
[1] 0.2585564 0.4148430
attr(,"conf.level")
[1] 0.95

Formatting this nicely and expressing it as percentages:

propNice<-paste(sep="",
                format(nsmall=1,round(digits=1,100*propEst$estimate)),
                "% (",
                format(nsmall=1,round(digits=1,100*propEst$conf.int[1])),
                "%, ",
                format(nsmall=1,round(digits=1,100*propEst$conf.int[2])),
                "%)")

cat(propNice,file="")
33.3% (25.9%, 41.5%)

Means, standard deviations:

To compute an arithmetic mean, the R function mean() can be used and for standard deviations the function sd() is used.

mu<-mean(dat$Petal.Length,na.rm=TRUE)
sigma<-sd(dat$Petal.Length,na.rm=TRUE)
sigma2<-sigma^2 # same as sigma2<-var(dat$Petal.Length,na.rm=TRUE)
n<-sum(!is.na(dat$Petal.Length))

ci<-c(
  mu - qnorm(0.975)*sigma/sqrt(n),
  mu + qnorm(0.975)*sigma/sqrt(n)
)

print(mu)
[1] 3.758
print(sigma)
[1] 1.765298
print(sigma^2)
[1] 3.116278
print(ci)
[1] 3.475499 4.040501

Medians, interquartile ranges:

To compute a median, the function median() is used and for IQRs, use quantile(). You can also use the function iqr() for the latter but this returns the width of the IQR.

me<-median(dat$Petal.Length,na.rm=TRUE)
iqr<-quantile(dat$Petal.Length,probs=c(0.25,0.75),na.rm=TRUE)

print(me)
[1] 4.35
print(iqr)
25% 75% 
1.6 5.1 

Grouping & summarising

Often it is easiest to summarise a dataset, when you group observations. The R functions group_by() and summarise() are useful here.

dat %>%
  group_by(Species) %>%
  summarise(
    mean=mean(Petal.Length),
    median=median(Petal.Length),
    sd=sd(Petal.Length),
    q25=quantile(Petal.Length,probs=0.25),
    q75=quantile(Petal.Length,probs=0.75)
    ,.groups="drop")
# A tibble: 3 × 6
  Species     mean median    sd   q25   q75
  <chr>      <dbl>  <dbl> <dbl> <dbl> <dbl>
1 setosa      1.46   1.5  0.174   1.4  1.58
2 versicolor  4.26   4.35 0.470   4    4.6 
3 virginica   5.55   5.55 0.552   5.1  5.88

Basic figures & graphs

Overview

R has powerful data visualisation capabilities, especially when you start using packages.

The standard R visualisation library is graphics, which is installed when you install R and loaded automatically at start-up. The main function here is plot().

However, the graphics package is quite basic and has inconsistent syntax and function names for different types of graphs.

The tidyverse comes with the R package ggplot2 which provides a more consistent way of producing graphs (it is based on a grammar of graphics).

For example, let’s do a scatter plot:

dat %>%
  ggplot(mapping=aes(x=Petal.Length,y=Petal.Width,col=Species,pch=Species)) +
  geom_point() +
  scale_colour_manual(values=c("steelblue","orange","salmon"),name="Species") +
  scale_shape_manual(values=15:17,name="Species") +
  xlab("Petal length") +
  ylab("Petal width") +
  labs(title="Scatterplot of petal length and width in the iris dataset",caption="Figure produced during the KUHeS CRM R training, using ggplot2.") +
  theme_light()

The aesthetic (the aes() call in the above) defines what you will see, i.e. the data you will plot. The geom will define the geometry, i.e. how you will plot the data.

The basics behind ggplot2 (https://github.com/rstudio/cheatsheets/raw/master/data-visualization.pdf).

It would not make much sense, but we could decide to plot the same data not as a scatterplot but as a line graph. All we need to do is to change the geom from geom_point() to geom_line().

dat %>%
  ggplot(mapping=aes(x=Petal.Length,y=Petal.Width,col=Species,pch=Species)) +
  geom_line() +
  scale_colour_manual(values=c("steelblue","orange","salmon"),name="Species") +
  scale_shape_manual(values=15:17,name="Species") +
  xlab("Petal length") +
  ylab("Petal width") +
  labs(title="Line graph of petal length and width in the iris dataset",caption="Figure produced during the KUHeS CRM R training, using ggplot2.") +
  theme_light()

For very different graphs, there is very similar code to write with ggplot2. You can also use more than one geom:

dat %>%
  ggplot(mapping=aes(x=Species,y=Petal.Width,col=Species)) +
  geom_boxplot() +
  geom_jitter(height=0,width=0.2) +
  scale_colour_manual(values=c("steelblue","orange","salmon")) +
  scale_fill_manual(values=c("steelblue","orange","salmon")) +
  xlab("Species") +
  ylab("Petal width") +
  labs(title="Box and jitter plot of petal width against species in the iris dataset",caption="Figure produced during the MATSURV R training, using ggplot2.") 

Resources

  • The R Graph Gallery is an excellent resource to get started when you have a specific plot in mind that you want to do for your data.

  • The ggplot2 cheat sheet is another excellent resource.

Distribution figures

Distributions: what values did you record for a given variable?

  • Discrete values (nominal, ordinal, integer-valued) values:
    • bar plots
    • pie charts
  • Continuous values
    • histograms

Let’s generate some data:

typesTmp<- paste(sep="","type",1:3)
type<-factor(sample(typesTmp,
                    size=1000,
                    replace=T,
                    prob=c(0.45,0.3,0.25)))
x1<-rbinom(1000,size=1,prob=0.25)
x2<-rpois(1000,lambda=ifelse(type=="type3",6,4))

dat<-data.frame(type,x1,x2) %>%
  mutate(x3=ifelse(type=="type1",
                   rnorm(sum(type=="type1"),mean=-2),
                   ifelse(type=="type2",
                          rnorm(sum(type=="type2"),mean=2),
                          runif(sum(type=="type3"))
                          )
                   )
         ) %>%
  mutate(x4=rnorm(n(),mean=x3))

Distribution: bar plots

Bar plots simply show bars the height of which is proportional to the relative frequency of the different levels of a categorical variable in your dataset. In ggplot2, the geom to use for barplots is geom_barplot().

Simple bar plots for 3 variables.

ggplot(data=dat,mapping=aes(x=x1)) +
  geom_bar()

ggplot(data=dat,mapping=aes(x=x2)) +
  geom_bar()

ggplot(data=dat,mapping=aes(x=type)) +
  geom_bar()

Stratifying a bar plot by a grouping variable (in this case the variable type).

dat %>%
  count(type,x2) %>%
  complete(type,x2,fill=list(n=0)) %>%
  ggplot(mapping=aes(fill=type,y=n,x=x2)) +
  geom_bar(position="dodge",stat="identity") +
  theme(text = element_text(size=20))

You can customise graphs by adding other options for the coordinate system, the annotations, the theme etc.

ggplot(data=dat,mapping=aes(x=x2)) + 
  geom_bar() +
  coord_flip() +
  ggtitle("Barplot for variable x2") +
  xlab("values") +
  ylab("count") +
  theme(text = element_text(size=20))

You can change the coordinate system to get a circular plot (note, use these carefully - most often they are NOT a good idea).

For this you can use coord_polar().

To get annotations for each bar, requires a bit more work (but see the R Graph Gallery for examples).

ggplot(data=dat,mapping=aes(x=x2)) +
  geom_bar() + coord_polar(start=0) + theme_minimal() +
  theme(
    axis.text = element_blank(),
    axis.title = element_blank(),
    panel.grid = element_blank(),
    plot.margin = unit(rep(-2,4), "cm") # removing unnecessary margins
  ) +
  ylim(-100,250) # dtermines size of the inner radius

Distribution: pie charts

In ggplot2, pie charts are just bar charts with a special (polar) coordinate system.

ggplot(dat, mapping=aes(x=factor(1), fill=factor(type))) +
  geom_bar(width = 1) +
  coord_polar("y") + 
  xlab("") +
  theme_void()

Note:

Generally, R discourages the use of pie charts (they are horrible & misleading - human brains are not good at judging the relative sizes of areas).

Distribution: histograms

  • Summarise the distribution of continuous variables, by binning data into discrete intervals.
  • Can be used for binary and discrete variables as well, but barplots better suited for such variables.

In ggplot2, the geom_histogram() is used to generate histograms.

ggplot(data=dat,mapping=aes(x=x3)) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Note the warning message above. You can ignore this, but R is just trying to tell you that other bin widths / number of bins may work better. You can set this by specifying the bins or the bindwidth parameter.

ggplot(data=dat,mapping=aes(x=x3)) +
  geom_histogram(binwidth=0.1)

ggplot(data=dat,mapping=aes(x=x3,stat(density))) +
  geom_histogram(binwidth=0.15)
Warning: `stat(density)` was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

There’s many other parameters you can specify to customise the histogram graph.

ggplot(data=dat,mapping=aes(x=x3,stat(density))) +
  geom_histogram(binwidth=0.15) +
  coord_cartesian(xlim = c(-10, 10)) +
  ggtitle("Histogram") +
  theme_light() +
  theme(text=element_text(size=24))

We can add kernel density estimates for the distribution function.

ggplot(data=dat,mapping=aes(x=x3,stat(density))) +
  geom_histogram(binwidth=0.15) +
  geom_density(bw="SJ",col="blue",lwd=1)

We can stratify by a grouping variable (variable type in this case).

ggplot(data=dat,mapping=aes(x=x3,fill=type)) +
  geom_histogram(binwidth=0.15,position="dodge")

You can change the default colours that are being used for the type variable. The aesthetic element governing the colour of the bars is the variable specified by fill=type. So to change the colours, we need to use the function scale_fill_manual().

ggplot(data=dat,mapping=aes(x=x3,fill=type)) +
  geom_histogram(binwidth=0.15,position="dodge") +
  scale_fill_manual(values=c("steelblue","orange","salmon"))

Covariation figures

Covariation figures let you visually explore whether there is a relationship between 2 or more variables.

Covariation: box & whisker, violin plots

Boxplots, and their variation violin plots, allow you to explore what, if any, relationship there is between a binary / discrete / categorical and a continuous variable. For boxplots the geom geom_boxplot() is used is used.

ggplot(data=dat,mapping=aes(x=type,y=x3)) +
  geom_boxplot()

Often it is nice to add individual data points on top. To better display these (since they all have the same x-axis coordinate), you need to ass jitter to their x-axis coordinates. ggplot2 provides a geom for this: geom_jitter(). By specifying some transparency (the alpha parameter), it is easier to get an idea where the ddata density is highest.

ggplot(data=dat,mapping=aes(x=type,y=x3)) +
  geom_boxplot() +
  geom_jitter(width=0.25,height=0,alpha=0.3) # height = 0 as we only want to jitter the x-axis values, not the y-axis values

You can also generate horizontal boxplots, by flipping the coordinate system.

ggplot(data=dat,mapping=aes(x=type,y=x3)) +
  geom_boxplot() +
  coord_flip()

It is important that you understand what information is summarised in a boxplot:

Wickham, H. & Grolemund G., R for Data Science, O’Reilly, 2016

Violin plots are similar to boxplots, but they combine box plots with histograms. In ggplot2, the geom geom_violin() is used to generate these types of graphs.

ggplot(data=dat,mapping=aes(x=type,y=x3,fill=type)) +
  geom_violin() +
  geom_boxplot(width=0.05, fill="white")

Covariation: scatter plots

The scatter plot, which shows 2 continuous variables plotted against each other, is probably the most basic and most used covariation plot. For example, it is the basis of any graph showing a regression model.

The geom geom_point() is used to generate scatter plots in ggplot2.

ggplot(data=dat,mapping=aes(x=x3,y=x4)) +
  geom_point()

This can be customised and you can easily add smooth regression lines with geom_smooth().

Note that the colour of the individual points is given by the col argument and so scale_colour_manual() is used to specify colours (rather than scale_fill_manual() as was used above in the example for histograms).

ggplot(data=dat,mapping=aes(x=x3,y=x4,col=type)) +
  geom_point() +
  geom_smooth() +
  scale_colour_manual(values=c("steelblue","salmon","orange"),
                      name="Type") +
  xlab("Variable x3") +
  ylab("Variable x4") +
  ggtitle("A covariation plot.") +
  theme_light() +
  theme(text=element_text(size=14))
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Covariation: line & time plots

If one of the 2 variables has an order to it, such as e.g. a time variable, then line graphs may be better suited. The geom geom_line() is used for this.

For an example, we will use a dataset that comes with R: beavers which consists of body temperature measurements from 2 beavers, stored in data frame beaver1 and beaver2.

beaver<-rbind(beaver1[beaver1$day==346,],beaver2[beaver2$day==307,])
beaver<-data.frame(
  name=c(rep("beaver1",sum(beaver1$day==346)),
         rep("beaver2",sum(beaver2$day==307))),beaver)

For a simple line graph, you just need and x and a y variable.

beaver %>%
  filter(name=="beaver1") %>%
  ggplot(mapping=aes(x=time,y=temp)) +
    geom_line()

You can easily stratify a line graph by a grouping variable (in this case the variable name) in ggplot2.

ggplot(data=beaver,mapping=aes(x=time,y=temp,colour=name)) +
  geom_line(lwd=1.5) # the lwd parameter sets the line width

Covariation: heat maps and contour plots

Heatmaps are useful to show covariation among 3 variables. The x- and y-axis show 2 of the variables and the third variable determines the colour of cells. The variables can be either categorical or continuous.

In ggplot2, one geom that you can use for heatmaps is geom_tile().

First, we need some data.

dens<-function(x,y){
  return(
    0.35*dnorm(x)*dnorm(y,sd=1.5) + 
    0.65*dnorm(x,mean=2,sd=2)*dnorm(y,mean=3)
  )
}

x<-seq(-2.5,6.5,by=0.05)
y<-seq(-3,5.5,by=0.05)

densSurf<-expand.grid(x=x,y=y) %>%
  mutate(dens=dens(x,y))

Then we can do a basic heat map, using default colours.

densSurf %>%
  ggplot(mapping=aes(x=x,y=y,fill=dens)) +
  geom_tile(width=0.05,height=0.05)

But we can also provide custom colour scales.

clrs<-colorRampPalette(c("blue","red","orange","yellow","white"))

densSurf %>%
  ggplot(mapping=aes(x=x,y=y,fill=dens)) +
  geom_tile(width=0.05,height=0.05) +
  scale_fill_gradientn(colours = clrs(200),name="probability density") +
  theme_minimal()

Contour plots are similar but use contour lines rather than colour scales to show variation in the third variable.

ggplot(data=densSurf,mapping=aes(x=x,y=y,z=dens)) +
  geom_contour()

You can combine heampaps and contour plots.

densSurf %>%
  ggplot(mapping=aes(x=x,y=y,fill=dens,z=dens)) +
  geom_tile(width=0.05,height=0.05) +
  geom_contour(col="darkgrey",lwd=0.35,alpha=0.75) +
  scale_fill_gradientn(colours = clrs(200),name="probability density") +
  theme_minimal()
Warning: The following aesthetics were dropped during statistical transformation: fill
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?

Maps

You can plot geographical location data like any other type of data.

However to load map data, deal with projections of 3-dimensional spherical coordinates onto a 2-dimensional plane and / or plot underlying geographical features is not trivial and beyond the scope of this introduction to R (and in general it is highly recommended you attend a GIS training for such applications).

Multi-panel figures

There’s multiple ways to generate multi-panel figures. Here we just cover 2 examples.

For a basic, but general way of combining multipe graphs onto the same figure, the function grid.arrange() from the R package gridExtra can be used.

The procedure is very simple: * Load the package (if not loaded already): library(gridExtra) * Assign each ggplot() call to an object (e.g. p1<-ggplot(...) + ...) * Produce a multipanel figure with grid.arrange(p1,p2,...,nrow=2)

The nrow parameter specifies over how many rows you want to arrange the plots.

g1<-ggplot(data=dat,mapping=aes(x=type,y=x3)) +
  geom_boxplot() +
  geom_jitter(width=0.25,height=0,alpha=0.3) +
  xlab("") +
  ylab("Variable x3") +
  ggtitle("A box & jitter plot.") +
  theme_minimal()

g2<-ggplot(data=dat,mapping=aes(x=x3,y=x4,col=type)) +
  geom_point() +
  geom_smooth() +
  scale_colour_manual(values=c("steelblue","salmon","orange"),
                      name="Type") +
  xlab("Variable x3") +
  ylab("Variable x4") +
  ggtitle("A covariation plot.") +
  theme_light() +
  theme(text=element_text(size=14))

grid.arrange(g1,g2,nrow=1)

The other approach we will cover, is if you want to repeat the same graph for different groups of observations within your dataset. This is called faceting and in ggplot2 the function facet_wrap() allows to do this very easily.

iris %>%
  ggplot(mapping=aes(x=Petal.Length,y=Petal.Width)) +
  geom_point() +
  facet_wrap(~Species)

Saving graphs to files

In ggplot2, the simplest way to save graphs is to use the function ggsave().

ggsave(g1,filename="myplot.png",width=16,height=9,units="cm",dpi=450)

Another way is to enclose your plotting code inside a statement that specifies a graphics device. “Graphics device” means a pdf, png, jpg or other file.

Example:

pdf(width=16,height=9,file="myfile.pdf") # opens the device; pdf will produce vector graphics
# plotting code here
dev.off() # closes the device

Or:

png(width=16,height=9,units="in",res=450) # raster graphics
# plotting code here
dev.off()

Basic statistical analysis

Common statistical tests

In this section we will provide a very quick introduction to how to do common statistical tests in R. We will not cover the theory and assumptions behind those tests - you will cover this later in the week in lectures. We exclusively focus on the R commands to do these tests.

Specifically we will look at: * Two-sample t-test. * Wilcoxon rank-sum test. * One-way ANOVA. * Kruskal-Wallis test. * Tow-sample test for proportions.

Two-sample t-test.

The two-sample t-test compares the sample means of a given variable across 2 groups of observations. It tests the null hypothesis of no difference in means against the alternative of the means in both groups to be different from each other. The two-sample t-test assumes that the sample means are normally distributed.

In R, you use the command t.test() to perform a t-test. You can either specify directly 2 vectors of observations or you can use the formula syntax in R.

As an example, let’s compare the sample means of the variable Sepal.Width between the versicolor and virginica species of flowers in the iris dataset.

# specify 2 vectors as input arguments x and y
vec1<-iris$Sepal.Width[iris$Species=="versicolor"]
vec2<-iris$Sepal.Width[iris$Species=="virginica"]
t.test(x=vec1,y=vec2)

    Welch Two Sample t-test

data:  vec1 and vec2
t = -3.2058, df = 97.927, p-value = 0.001819
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.33028364 -0.07771636
sample estimates:
mean of x mean of y 
    2.770     2.974 

The p-value is 0.0018 which is less than 0.05. So there is enough evidence, at the 5% significance threshold, to reject the null hypothesis that the 2 flower species have the same sepal width.

We can also use the formula syntax to do the same test. In this syntax you specify a formula of the format y ~ x where y is the response variable and x is the predictor variable. You then also need to specify a data frame that holds the data for the 2 specified variables.

t.test(Sepal.Width~Species,data=iris %>% filter(Species!="setosa"))

    Welch Two Sample t-test

data:  Sepal.Width by Species
t = -3.2058, df = 97.927, p-value = 0.001819
alternative hypothesis: true difference in means between group versicolor and group virginica is not equal to 0
95 percent confidence interval:
 -0.33028364 -0.07771636
sample estimates:
mean in group versicolor  mean in group virginica 
                   2.770                    2.974 

The function t.test() can also be used for one-sample t-tests as well as paired t-tests. To find out how to do this, start by reading the manual page for the function by typing ?t.test at the console.

Wilcoxon rank-sum test

The Wilcoxon rank-sum test, also called the two-sample Wilcoxon test or the Mann-Whitney U test, is a non-parametric alternative to the two-sample t-test. Rather than comparing the sample means, this test actually compares the entire distributions of a variable in 2 groups. It is most sensitive to changes in the median of the distribution and so this test is often interpreted as testing the null hypothesis of equal medians in two groups.

Unlike the t-test, the Wilcoxon rank-sum test does not assume a parametric distribution (the t-test assumed the normal distribution) for the data or statistics derived from the data (for samples containing 50 finite values or more, R will still use a normal approximation in the calculations though). It ranks the full combined data and then compares ranks between the 2 groups. If there was no difference in distribution between the 2 groups, then the ranks for the observations in each group should be similar. If not, then one group will have higher (or lower) ranks than the other.

In R you use the function wilcox.test() to perform a Wilcoxon rank-sum test. The use of this function is very similar to t.test().

Specifying 2 vectors as input arguments:

wilcox.test(x=vec1,y=vec2)

    Wilcoxon rank sum test with continuity correction

data:  vec1 and vec2
W = 841, p-value = 0.004572
alternative hypothesis: true location shift is not equal to 0

Using the formula syntax:

wilcox.test(Sepal.Width~Species,data=iris %>% filter(Species!="setosa"))

    Wilcoxon rank sum test with continuity correction

data:  Sepal.Width by Species
W = 841, p-value = 0.004572
alternative hypothesis: true location shift is not equal to 0

The p-value is 0.0046 which, while slightly different than the one from the t-test, us also less than 0.05. So we conclude there is enough evidence to reject the null hypothesis of equal sepal width distributions for the 2 species of flowers.

The function wilcox.test() can also be used for one-sample Wilcoxon tests (also called Wilcoxon signed-rank tests) as well as paired tests. To find out how to do this, start by reading the manual page for the function by typing ?wilcox.test at the console.

One-way ANOVA

ANOVA is an acronym for analysis of variance. It generalises the t-test to more than 2 groups. It is a parametric test, assuming normality of sample means, and the test statistics uses the F distribution, hence one-way ANOVA is an F-test.

Specifically, ANOVA tests the null hypothesis that all means \(\mu_1,...,\mu_k\) of \(k\geq2\) groups are equal against the alternative hypothesis that they are not. If the null hypothesis is rejected, the test will not tell why it is rejected: it could be that a single mean in one group is different from the others or no 2 means are the same – you will have to drill down after a null hypothesis rejection.

The test is based on comparing the variance (or sum of squares) between groups to that within groups, hence the name of the test.

One-way ANOVA is really just a linear regression for a continuous response variable and a single categorical predictor variable with \(k\) levels.

In R, you can use the functions aov() and anova() to do a one-way ANOVA test. In both cases, you will need to use formula syntax.

As an example we can compare the sample means for Sepal.Width between all 3 species of flowers in the iris dataset.

# using aov(); to get the p-value, you need to use summary() on the output from aov()
summary(aov(Sepal.Width~Species,data=iris))
             Df Sum Sq Mean Sq F value Pr(>F)    
Species       2  11.35   5.672   49.16 <2e-16 ***
Residuals   147  16.96   0.115                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# using anova(); you first need to fit a linear regression model for Sepal.Width and Species
anova(lm(Sepal.Width~Species,data=iris))
Analysis of Variance Table

Response: Sepal.Width
           Df Sum Sq Mean Sq F value    Pr(>F)    
Species     2 11.345  5.6725   49.16 < 2.2e-16 ***
Residuals 147 16.962  0.1154                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is almost 0: it is less than \(2.2\times10^{-16}\) which is the limit of the numerical precision on a standard computer – the null hypothesis of equal means is rejected.

Note that for \(k=2\), one-way ANOVA is identical to the two-sample t-test:

summary(aov(Sepal.Width~Species,data=iris %>% filter(Species!="setosa")))
            Df Sum Sq Mean Sq F value  Pr(>F)   
Species      1  1.040  1.0404   10.28 0.00182 **
Residuals   98  9.921  0.1012                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Kruskal-Wallis test

The Kruskal-Wallis test is the non-parametric equivalent to one-way ANOVA. It does not assume normal distribution of sample means and is based on ranks of all observations similar to the Wilcoxon rank-sum test. In R you use the function kruskal.test().

The input can either be specified using the formula syntax or as a list object with each group being represented by a vector of observations.

# using a list of vectors
vec1<-iris$Sepal.Width[iris$Species=="versicolor"]
vec2<-iris$Sepal.Width[iris$Species=="virginica"]
vec3<-iris$Sepal.Width[iris$Species=="setosa"]
kruskal.test(list(vec1,vec2,vec3))

    Kruskal-Wallis rank sum test

data:  list(vec1, vec2, vec3)
Kruskal-Wallis chi-squared = 63.571, df = 2, p-value = 1.569e-14
# using formula syntax
kruskal.test(Sepal.Width~Species,data=iris)

    Kruskal-Wallis rank sum test

data:  Sepal.Width by Species
Kruskal-Wallis chi-squared = 63.571, df = 2, p-value = 1.569e-14

The p-value is again extremely low, though just above the numerical precision limit. The conclusion is the same: the null hypothesis of equal distribution of Sepal.Width across the 3 flower species is rejected.

Two-sample test for proportions

When comparing a binary variables between 2 groups, it is best to use a test designed for comparing proportions. In R you can use the function prop.test() for this. It compares the null hypothesis of equal proportions in both groups against the alternative of the 2 proportions not being equal, assuming a normal approximation.

As an alternative you can arrange the data in a 2x2 table and then test the null hypothesis of independence of the row and the column variable against the alternative that they are not independent. For this you can either use an exact Fisher’s test (fisher.test() in R) or an approximate Chi-squared test (chisq.test() in R). The chi-squared test is identical to using prop.test().

As an example, let’s generate a dataset consisting of 2 treatment groups (A, 30 observations, and B, 70 observations) and an outcome variable indicating whether the treatment was successful, assuming that treatment A has a true success rate of 40% and B has a true success rate of 25%.

dat<-data.frame(
  group=c(rep("A",30),rep("B",70)),
  cured=0
)
dat$cured[dat$group=="A"]<-sample(x=0:1,size=sum(dat$group=="A"),replace=T,prob=c(0.6,0.4))
dat$cured[dat$group=="B"]<-sample(x=0:1,size=sum(dat$group=="B"),replace=T,prob=c(0.75,0.25))
prop.test(x=c(sum(dat$cured[dat$group=="A"]),sum(dat$cured[dat$group=="B"])),n=c(sum(dat$group=="A"),sum(dat$group=="B")))

    2-sample test for equality of proportions with continuity correction

data:  c(sum(dat$cured[dat$group == "A"]), sum(dat$cured[dat$group == "B"])) out of c(sum(dat$group == "A"), sum(dat$group == "B"))
X-squared = 1.0417, df = 1, p-value = 0.3074
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.09956594  0.34718499
sample estimates:
   prop 1    prop 2 
0.3666667 0.2428571 
tbl2x2<-table(dat$group,dat$cured)
chisq.test(tbl2x2)

    Pearson's Chi-squared test with Yates' continuity correction

data:  tbl2x2
X-squared = 1.0417, df = 1, p-value = 0.3074
fisher.test(tbl2x2)

    Fisher's Exact Test for Count Data

data:  tbl2x2
p-value = 0.2304
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.2017181 1.5673380
sample estimates:
odds ratio 
 0.5574867 

Linear regression

The simplest case for a regression model is the one where we have 2 continuous variables: a response or dependent variable y and a predictor or independent variable x. In the linear regression problem, we then try to find parameters \(\beta_0\) and \(\beta_1\) such that we can express every observation \(\{x_i,y_i\},\;i=1,\ldots,n\) in a dataset of \(n\) observations as

\[ y_i=\beta_0+\beta_1x_i+\epsilon_i \] Where \(\epsilon_i\) are error or residual terms.

The technical aspects revolve around how to estimate \(\beta_0\) and \(\beta_1\) such that we end op with optimal parameter values for them. There are 2 main ways to go about this:

  1. Try to minimise the total error, usually expressed as a sum of squares: \(\sum_i \epsilon_i^2\). This is called least squares estimation.
  2. Maximise the likelihood of observing the data that was observed assuming that the model \(E[Y]=\beta_0+\beta_1X\) generated the data. This is called maximum likelihood estimation.

The first method does not make any assumptions about the distribution of the error \(\epsilon_i\) and is purely a way of optimising the fit of the regression line. Visually, we aim to minimise the sum of the squared lengths of the red segments in the figure below.

df<-tibble(
  x=runif(25,min=-5,max=5),
  y=1.5*x+rnorm(25,sd=2)+3.5
)

ggplot(data=df,aes(x=x,y=y)) + 
  geom_abline(intercept=3,slope=1,colour="steelblue",lwd=2) +
  geom_segment(aes(x=x,xend=x,y=y,yend=x+3),colour="red",lwd=1.5) +
  geom_point(size=4) +
  theme(text = element_text(size=20)) 

The second method requires assuming a probability distribution for the random variable Y, given the observed values of X (in other words, this involves assuming a probability distribution for the error terms \(\epsilon_i\)).

In the most common situation, one assumes that the errors are normally distributed with mean 0 and in this case, the 2 methods give the same results.

In R you can use the function lm() for a least squares fit and the function glm() for a maximum likelihood fit. Unless you specify other distributions in glm, both functions will give the same results. The model is specified using the formula syntax we have seen when we looked at the t-test.

We can use the df data frame we generated above for the least squares figure for an example.

modLm<-lm(y~x,data=df)
summary(modLm)

Call:
lm(formula = y ~ x, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4229 -1.9234  0.3228  1.4761  4.2588 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.7358     0.4230   8.832 7.54e-09 ***
x             1.5114     0.1373  11.008 1.21e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.102 on 23 degrees of freedom
Multiple R-squared:  0.8405,    Adjusted R-squared:  0.8335 
F-statistic: 121.2 on 1 and 23 DF,  p-value: 1.211e-10
modGlm<-glm(y~x,data=df)
summary(modGlm)

Call:
glm(formula = y ~ x, data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.4229  -1.9234   0.3228   1.4761   4.2588  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.7358     0.4230   8.832 7.54e-09 ***
x             1.5114     0.1373  11.008 1.21e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 4.4194)

    Null deviance: 637.22  on 24  degrees of freedom
Residual deviance: 101.65  on 23  degrees of freedom
AIC: 112.01

Number of Fisher Scoring iterations: 2

The 2 parameters \(\beta_0\) and \(\beta_1\) have a specific interpretation:

  • \(\beta_0\) is where the line intersects the y-axis. It is called the intercept and is the expected value of \(Y\) if \(X\) is set to 0.
  • \(\beta_1\) determines the direction and steepness of the regression line. It is called the slope and is the average increase in \(Y\) if \(X\) is increased by 1 unit.

\(\beta_0\) and \(\beta_1\) are the true, unknown parameter values. For any dataset we can only calculate estimates of them: \(\hat{\beta}_0\) and \(\hat{\beta}_1\). The object of statistical inference is usually \(\beta_1\), the slope parameter: if it is 0, then there is no (linear) relationship between the 2 variables. As such, one often want to test the null hypothesis that the slope parameter is 0 against the alternative that it is not 0.

The p-value from this test is what R reports in the column Pr(>|t|) in the “Coefficients” section of the screen output from lm() and glm(). In the above example, this p-value is extremely low (essentially 0) and so, at the 5% signifiance level, we conclude that in this example, there is sufficient evidence to reject the null hypothesis that the slope is 0, i.e. there is enough evidence to reject the null hypothesis that \(Y\) and \(X\) are not linearly related.

You can specify more than one predictor / independent variable, simply use the + sign on the left-hand side of ~:

# generating another variable
df<-df %>%
  mutate(z=rnorm(nrow(df)))

# fitting a model with 2 predictor variables
modGlm2<-glm(y~x+z,data=df)
summary(modGlm2)

Call:
glm(formula = y ~ x + z, data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.3813  -1.8684   0.2985   1.3676   4.2930  

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.7576     0.4426   8.489 2.17e-08 ***
x             1.5145     0.1409  10.750 3.19e-10 ***
z             0.1070     0.4745   0.226    0.824    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 4.609621)

    Null deviance: 637.22  on 24  degrees of freedom
Residual deviance: 101.41  on 22  degrees of freedom
AIC: 113.95

Number of Fisher Scoring iterations: 2